[NV TRT RTX EP] Leverage ORT allocator for workspace allocations#25564
[NV TRT RTX EP] Leverage ORT allocator for workspace allocations#25564gedoensmax wants to merge 5 commits intomicrosoft:mainfrom
Conversation
|
@jywu-msft This PR is a blocker to run LLMs with TRT-RTX EP. Can you please merge this? |
@jywu-msft We would also like to have this for WinML GA. Could you please help cherry-pick it in the right branch? |
There was a problem hiding this comment.
Pull Request Overview
This PR leverages the ORT allocator for workspace allocations in the NVIDIA TensorRT RTX execution provider, significantly reducing memory usage for models with wide dynamic shape ranges. The change removes the previous context memory sharing mechanism and replaces it with dynamic allocation using ORT's allocator infrastructure.
Key changes include:
- Removal of the
context_memory_sharing_enableconfiguration option and related infrastructure - Implementation of dynamic context memory allocation using ORT allocator with per-context memory management
- Addition of utility functions to detect dynamic shapes in TensorRT tensors
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| nv_basic_test.cc | Updated test configuration and corrected model filename for AutoEP test |
| nv_execution_provider_utils.h | Added utility functions for detecting dynamic shapes in TensorRT tensors |
| nv_execution_provider_info.h | Removed context_memory_sharing_enable configuration option |
| nv_execution_provider.h | Updated OutputAllocator to use ORT allocator and modified state structures for dynamic memory management |
| nv_execution_provider.cc | Implemented dynamic context memory allocation logic and removed static memory sharing code |
| class OutputAllocator : public nvinfer1::IOutputAllocator { | ||
| public: | ||
| OutputAllocator() = delete; | ||
| OutputAllocator(OrtAllocator* allocator) : alloc_(allocator) {}; |
There was a problem hiding this comment.
The semicolon after the closing brace is unnecessary for constructor definitions. Remove the semicolon.
| OutputAllocator(OrtAllocator* allocator) : alloc_(allocator) {}; | |
| OutputAllocator(OrtAllocator* allocator) : alloc_(allocator) {} |
| if (trt_state->context_memory_size != mem_size) { | ||
| LOGS_DEFAULT(INFO) << "[NvTensorRTRTX EP] A new context memory was allocated with size " << mem_size; | ||
| trt_state->context_memory = IAllocator::MakeUniquePtrFromOrtAllocator<void>(alloc, mem_size, false /*use_reserve*/); | ||
| // trt_state->context_memory = IAllocator::MakeUniquePtr<void>(alloc, mem_size, false /*use_reserve*/, stream); |
There was a problem hiding this comment.
This commented-out line should be removed as it appears to be leftover debug/alternative implementation code.
| // trt_state->context_memory = IAllocator::MakeUniquePtr<void>(alloc, mem_size, false /*use_reserve*/, stream); |
There was a problem hiding this comment.
I want to keep this as a TODO for an improvement coming soon that uses AllocOnStream
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows x64 QNN CI Pipeline |
|
Azure Pipelines successfully started running 5 pipeline(s). |
|
The windows runners seem to be stuck in setup phase. |
restarted them. |
|
@jywu-msft Since this has been somewhat delayed and held us from opening more branches that build on top of these changes we cam up with a cumulative merge branch. #25656 |
Could you help update the PR description for that cumulative merge branch? I will review it. |
|
@chilo-ms I updated the description and left some more comments |
|
Close this PR since it's duplicated in #25656 |
Description
This leverages the OrtAllocator for intermediate workspace required to execute the TRT engine. With this change we are able to significantly reduce memory usage for models with wide dynamic shape ranges as seen on ORT GenAI.
@jywu-msft @chilo-ms from our side reviews on this are done.